fiZtr/v/Z/tJT 


Semi-Annual Report 
to 

NASA-Ames Research Center 
Grant Award No. NAG 2-304 

NASA Technical Officer: Karl Anderson 


for 


A Survey of the State of the Art and 
Focused Research in Range Systems - Task II 


Principal Investigator 
Kung Yao 


Electrical Engineering Department 
University of California 
Los Angeles, California 90024 


June 1986 


(NftSfi-CR- 176988) ft SURVEY OF THE STATE OF 

;the art and focused research in range 
I SYSTEMS, TASK 2 Semiannual Report 
(California Univ. ) 22 p CSCL 09C 

G3/33 


N86-30069 

THRU 

N86-3G072 

Uaclas 

43497 



In the period from January to June 1986, the following research activities 

and publications performed under NASA-Ames Research Contract Grant Award 

Number NAG-2-304 are enclosed: 

1. C. Y. Chang and K. Yao, "On Some Equivalent Configurations of Systolic 
Arrays," Proc. of the Twentieth Ann. Conf. on Information Sciences and 
Systems, Princeton, NJ, March 1986. 

2. S. Kalson and K. Yao, "Results in Least-Square Estimation Algorithms 
with Systolic Array Architectures," in Digital Communications, edited 
by E. Biglieri and G. Prati, Elsevier Science Press, 1986, pp. 235-249. 

3. K. Konstantinides and K. Yao, "Modeling and Equalization of Nonlinear 
Bandlimited Satellite Channels," Conf. Record of Inter. Conf. on Communi- 
cations, June 1986, pp. 1622-1626. 

4. M. J. Chen and K. Yao, "On Realizations of Least-Squares Estimation and 
Kalman Filtering by Systolic Arrays," Proc. of Inter. Workshop on Systolic 
Array, Oxford, England, June 1986. 



ORIGINAL W8SJ8 

OF POOR 


ON SOME EQUIVALENT CONFIGURATIONS 
OF SYSTOLIC ARRAYS 


-3> r 


C.Y. Chang and K. Yao 
Electrical Engineering Dept, 
Univeraity of California 
Los Angeles, CA 90024 


CDI ^ 0 ' 7 i?. 

N 8 6- WO 7 0 


ABSTRACT 

A systematic approach is presented for 
designing systolic arrays and their equivalent 
configurations for certain general classes of 
recursively formulated algorithms. A new method is 
also introduced to reduce the input bandwidth and 
storage requirements of the systolic arrays through 
the study of dependence among the input data. Many 
well known systolic arrays can be rederived and 
also many new 6y6tolic arrays can be discovered by 
this approach. 


I . INTRODUCTION 


A systolic array i6 a network of processors 
that rhythmically process and pass data among 
themselves. It provides pipelining, parallelism, 
and simple adjacent neighbor cell interconnection 
structure so that it is suitable for VLSI 
implementation. While mo6t of the earlier systolic 
array algorithms were discovered beuristically 
[1-3]. there has been various work on systematic 
approaches to the design of systolic array 
algorithms [4*6]. In this paper, ve shall present 
a systematic approach for designing systolic arrays 
and especially focus on their equivalent 
configurations for certain general classes of 
recursively formulated algorithms. In order to 
reduce the input bandwidth and storage requirements 
of the systolic arrays, the dependence among the 
input data is also investigated in details. It is 
6bovn that many well known systolic arrays can be 
rederived and also many new systolic arrays can be 
discovered by this systematic approach. For 
simplicity of illustration, we mainly consider the 
linear systolic array in this paper. The same idea 
can also be generalized to the two dimensional 
mesh-connected systolic arrays. 

II. IMPLEMENTATION OF RECURSIVELY 
FORMULATED ALGORITHMS 


Consider two simple but important vays of data 
flow pattern in a linear systolic array as shown in 
Figure 1 and 2. In these two figures, ?., Q., and 
b . . are three given input data sequences and^R. is 
to J be the output data sequence, where 0£i<m~l ind 
0<j<n-l. For the systolic array t ghown in Figure 
1. Q. and R. are stored in the j processor, where 
R. will be Updated while P. is moving to the right 
add b. . is moving down. For the systoli^array 
shown "in Figure 2, P. is stored in the i 
processor and R. wilt be updated as it is moving to 
the right with while b. . is moving down. All of 
the data movements are synchronized. The R.'s will 
successively have the required output data dfter m 
steps. For convenience, according to the R.'s 
behavior of these two systolic arrays, tbey^are 
respectively named as R-stay and R-move linear 
systolic arrays. There is great similarity betveen 
these two systolic arrays. It can be shown that a 
large class of interesting problems in the real 


world can be implemented by thebe two types of 
linear systolic arrays. Besides, various different 
but equivalent configurations of linear systolic 
arrays can also be derived from them. 


Procedure 1 : Given any problem which can be 

formulated so that it has P., Q., and b.. as three 
input data sequences and R.'as Che output! data 
sequence, where 0<i<jn-l anC 0<i<n-l. if R. can 
be generated through the following recurrence 
equation 


■ «v v v s / i3 >- 


( 1 ) 


where R. ^ contains some initial value, f is any 
functierLof four variables P., Q., b. ., and R. , 
and R. & is the required ouiput^data^R . , theil this 
problem can be implemented by the R-stap linear 
systolic array of n processors and the R-move 
linear systolic array of m processors. □ 


The complexity and the configuration of the 
systolic array depend on the complexity of the 
function f and the generation procedure of b... 
Some regularity and dependence among b . . •s mlp 
greatly simplify the whole system. J 


III. MAPPING INTO FAN-IN TYPE 
LINEAR SYSTOLIC ARRAY 


Note that for the two linear systolic arrays 
shown in Figure I and 2, -the input bandwidth and 
storage requirements are large in comparison to the 
number of processors in the array, which may be 
either infeasible or inefficient for many 
applications of interests. This is mainly because 
the dependence among the b..'s is not efficiently 
utilized so that each processor needs its own 
external input connection due to the existence of 
all the b^.'s. It is expected that under certain 
circumstances not all of these external input 
connections are required. In this paper, we are 
also very interested in the issue of reducing the 
input bandwidth and storage requirements by showing 
under what conditions these external input 
connections can be removed so that only the very 
first processor is allowed to have such a 
connection, i.e., the input sequences can only be 
fanned in through the systolic array. It i6 shown 
that the existence of certain patterns of 
dependence among the b..'s allows themselves to be 
fanned-in generated by Slightly modifying the 
operations involved in each processor without 
losing the property of adjacent neighbor 
interconnection structure. These conditions are 
shown in the following two procedures. 


Procedure 2 : For the R-stay linear systolic 

array, if b. . can be determined through the 
following dependence equation 


b. . = T(b. , b. . , ; u. ; 
ij i-l. j 1*3*1 i 


V j > * 


(2) 



vbere u, ii » variable vbicb depends only on i, v. 
is • variable vbicb depends only on j, and T is a J 
function of four variables, then b.. can be 
generated by the fan-in scheme systolic array as 
shown in Figure 3 rather than being broadcast as 
ehovn in Figure 1, Also note that b , • as veil as 
v., vbicb depends only on j, can be prlioaded in 
tile j processor, and b^ as veil as u^, vbicb 
depends only on i can be deed as a fanned-in input 
sequence, 0 

Mote that for tbe R-stay linear systolic array 
sbovn in Figure 1, if b.. is tbe current input to 
the j processor, then lL_j • is tbe previous- 
input to tbe j processor b. j is tbe 
previous input to tbe (j-l) S professor. It is 
understandable that in order to avoid the violation 
of tbe adjacent neighbor interconnection structure, 
b.. can only depend on b._j • and b. as veil as 
ttj data that can be preloai^d and data that 
can be fanned in, vbicb is vhat Procedure 2 is 
about. In general, tbe systolic array sbovn in 
Figure 3 has tvo sets of input data. One of them 
consists of three fanned-in data sequences, P., u., 
and b. . , vhich depend only on tbe i index, and 
tbe oifier set consists of three preloaded data 
sequences, Q., v. and b_j . , vbicb depend only on 
tbe j index, ^wbeie u-, v.J-^b. and b_j . are used 
to generate all the each professor, 

four registers are required, namely Q , V , B and 
R, vbere registers Q and V are used^to Store the 
preloaded data Q. anS v. respectively. Initially 
register, B.is loided as J b . and register R is set 
to be R. , both of vhicb 1 6^11 be updated as the 
systolii array start operation. The reason to 
include so many data sequences is to take care of 
tbe general cases. However, it is expected that in 
many applications, not all of these fanned-in and 
preloaded data sequences are required. It is often 
the case that tbe fan-in generation process of b. . 
simply depends on two or three data sequences vbicb 
can either be fanned-in or preloaded. Similarly 
for tbe R-move linear systolic array, very similar 
results can be obtained as follovs. 


Procedure 3 : For the R-move linear systolic 

array, if b.. can be determined through tbe 
following dependence equation 


b. . 

i-J 


T(b i-i.j ; b i.j-l { V V j ) * 


(3) 


vhere u. is a variable which depends only on i, v. 
is a variable which depends only on j, and T is a 2 
function of four variables, then b.. can be 
generated by tbe fan-in scheme sysijlic array as 
shown in Figure 4 rather than being broadcast as 
sbovn in Figure 2. Also note that b. , as veil as 
u., vj^ch depends only on i, can be pfeloaded in 
tie processor, and b_j . as veil as v., vbicb 
depends only on j, can be died as a fannei-in input 
sequence. 0 


Note that for tbe R-move linear systolic array 
shovn t jin Figure 2, if b^ . is tbe current input to 
the i proces|gr, tben^iL is the previous 
input to the i processor 'Jpa b^j . is tbe 
previous input to the (i-l) B proceSior. What 
procedure 3 says simply repeats tbe fact that in 
order to avoid the violation of adjacent neighbor 
interconnection structure, b. . can only depend on 
b. . . and b. . , as veil as 1 Phe data that can be 
prelApded an&’Pbe data that can be fanned in. In 
general, the systolic array sbovn in Figure 3 has 


tvo sets of input data. One of them consists of 
three fanned-in data sequences, Q., v., and b_j ., 
which depend only on tbe j index, J and J the other’set 
consists of three preloaded data sequences, P^» u^, 
and b- , vbicb depend only on tbe i index, vbere 
u.» v*S 6. , and b , . are used to generate all 
tie bi.'ai* For eacb 1 fPocessor, three registers are 
required, namely 0 , B and P, vbere registers P and 
U are used to stoFe the preloaded data P. and u.. 
Initially register B is, loaded as b. , and output 
data R. is set to be R. both of L $bxch will be 
updated as tbe systolii array start operation. 


Tbe previous three procedures provide a rather 
systematic approach to design the systolic array 
architecture for tbe implementation of a given 
problem. At first, by checking tbe existence of 
the recurrence relationship as shown in equation 
(1), ve are able to know if there exist any 
systolic arrays as sbovn in Figure 1 and 2. Next, 
by checking the dependence among the b..'s as sbovn 
in equations (2) and (3), ve are able know tbe 
existence of tbe fan-in type systolic arrays as 
shown in Figure 3 and 4 so that only small input 
bandwidth and storage are required. Tbe key issue 
is in how to search for the recurrence function f 
and tbe dependence function T. It is expected that 
there may exist several different forms of 
functions due to different possible approaches to 
formulate a given problem. Various forms of theBe 
functions simply create many different but 
equivalent configurations of systolic arrays. Also 
note that in tbe previous discussion, P, Q, b, u, 
and v are somewhat treated as single variables, 
however it ia clear that they can be set of 
variables and tbe same results still hold. This 
approach can be applied to design systolic arrays 
for many interesting problems in the real world. 
Various new configurations of systolic arrays can 
be derived. In the next section, ve shall 
illustrate this design approach by considering the 
DFT algorithm. 

IV. SYSTOLIC ARRAY ARCHITECTURE 

FOR DISCRETE FOURIER TRANSFORM 


Given n discrete data a. in tbe time domain, 

♦ X 

vhere 0<iv£n-l, and n discrete frequencies = 

( e i2 *7 n )j £ n tbe frequency domain, vhere 0<j<n-l, 
the discrete Fourier transform (DFT) is to compute 


a . V .®" 1 ♦ a V .®” 2 ♦ 
n-1 j n-2 j 


+ *! w j + V 


Let 


f(P, Q, b; R) = (R x b) * P. 


By induction, it can be shown that by letting 

»j tMJ * * V ♦ Vi-2 «> 

and = * n .j» then y.^® ^ = y., is tbe 

required output. Tbe existence of recurrence 
function f and tbe satisfaction of the recurrence 
relationship guarantee that there exists systolic 
arrays for tbe implementation of discrete Fourier 
transform as shown in Figure 5 and 6. 


It can be seen from Figure 5 and 6 that tbe 
b..'s are not totally independent. Note that P. = 
a l £. „ and b. . = W.. In order to see if b. . can be 
fanned-in generate j, let us examine the da U 
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'dependence among the b..'s. Many different forma 
of dependence function i exist. For example* 


b. . = T(b. . b. . u.j v.) 
ij _ x-1. J* i* 3 


(5) 


where v. = V.. The pair of systolic arrays based 
on equations'* (4) and (S) are shovn in Figure 7 and 
8. The systolic array ahovn in Figure 8 is the 
veil known systolic DFT [2] » whose discovery 
appears to be heuristic rather than in a systematic 
manner ss from our approach. For another example 
of T function, note that 


i.e.. 


b. . * V. = V, 


IJ - b J l 1 1 

>. . = T(b. . . ( b. . . { u. j v.) 

lJ = b x '*~ l 1 j 

i.j-1 i 


( 6 } 


where u^ = Vj and , vhich can be either 

used as fanned-in sequences of the R-gtay linear 
systolic array or preloaded in the i processor of 
the R-move linear systolic array. The pair of 
systolic arrays based on equations (4) and (6) are 
shown in Figure 9 and 10. 


Another interesting issue is thst the type of 
function f used in this example does not belong to 
the class of general matrix vector multiplication. 
This confirm the fact that the class of problems 
covered in the Procedure 1 really contains not only 
the class of general matrix vector multiplication. 
As well known, there are two different ways to 
consider the discrete Fourier transform. One shows 
that the OFT is a special case of the evaluation of 
a polynomial and the other shows that the OFT is a 
special case of general matrix vector 
multiplication. The first way was just considered 
in this example. Let us see what can be obtained 
by following the second way. Let 


f(P, Q. b; R) = R ♦ (P x b). 


By induction, it can be sbovn that by letting 



f and the satisfaction of the recurrence 
relationship guarantee that there exists systolic 
arrays for the implementation of OFT as shovn in 
Figure 11 and 12. 


From Figure 11 and 12 it can also be seen that 
the b..'s are not totally independent. Note that 
P^ - a^ and b.. = W. 1 . Let us examine the data 
dependence amoilg tbt b^.'s. Note that 


b. . = 
ij = 


i.e. . 


V . 1 = w.J 
bt . ,W X . 

i.J-1 1 


= w.J- 

1 


b. . = T(b. 


rhK 

i.j-i i 


b. 


i.j-1* 


v. V 

j-i i 




( 8 ) 


where u^ = and b^ , which can be either 

used as fanned-in sequences of the R-gtay linear 
systolic array or preloaded in the i processor of 
the R-move linear systolic array. The pair of 
systolic arrays based on equations (7) and (8) are 
ahovn in Figure 13 and 14. Also note that 


i.e. 


ij 


b. . 

ij 


V.‘ = V. 
3 

T(b. 


i-1. 


rt. 


b. 

1- 


i.iV 


b. . 


i* 


V 




( 9 ) 


where v . = W. and 


• fchl. 


= W 


-1 


which can be either 


preloadtd in^the j ftocestor of the R-atay linear 
systolic array or used as fanned-in sequences of 
the R-move linear systolic array. The pair of 
systolic arrays based on equations (7) and (9) are 
shown in Figure 15 and 16. 


This DFT example shows that under certain 
circumstances it is possible to formulate a given 
problem in several different ways to implement with 
various different but equivalent configurations of 
systolic arrays. 


V. CONCLUDING REMARKS 


A systematic approach is presented for 
designing systolic arrays and deriving their 
equivalent configurations for certain general 
classes of recursively formulated algorithms. This 
approach can be considered as a two-stage design 
procedure. In the first stage, the existence of 
recursiveness is investigated. If it exists, 
according to the same formulation the input data 
are classified into three parts, two of them, P. 
and Q., depend only on one index, and another one 
of thtm, namely b- . depends on both index i and j, 
so that the systolic arrays shown in Figure 1 and 2 
apply. However, for certain applications, it is 
either infeasible or inefficient to store all of 
the b..'s. In the second stage, the dependence 
among^he b..'s is then investigated to see if it 
can be used l t!o fan-in generate the b^.'s through 
the data sequence that can either be^reloaded or 
fanned in. For a given problem, various 
formulations of the recursive property and the 
dependence among the b..’s are possible, which 
simply lead to many different but equivalent 
configurations of systolic arrays. 

So f8r we mainly deal with the linear systolic 
arrays. However, the same technique can be easily 
generalised to the two dimensional mesh-connected 
systolic arrays, since the mesh-connected systolic 
arrays can be simply treated ss the concatenation 
of many linear systolic arrays. 
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Figure 1: The R-stay linear systolic 
array. 
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Figure 3: The fan-in scheme of R-stay 
linear systolic array. Note that the 
register B in the jt * processor is 
initially loaded with b-i,j. 
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Figure 5: R-stay linear systolic array of 
discrete Fourier transform based on 
equation (4). 


Figure 2 - The R-move linear systolic 
array. 
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Figure 4: The fan-in scheme of R-move 
linear systolic array. Note that the 
register B in the ith processor is 
initially loaded with bi,-i. 
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Figure 6: R-move linear systolic array of 
discrete Fourier transform based on 
equation (4). 



Figure 7: R-stay linear systolic array of 
discrete Fourier transform based on 
equations (4) and (5). 


Figure 8: R-move linear systolic array of 
discrete Fourier transform based on 
equations (4) and (5) 





























Figure 9- R-stay linear systolic array of 
discrete Fourier transform based on 
equations (4) and (6). 
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Figure 11: R-stay linear systolic array 
of discrete Fourier transform based on 
equation (7). 


Figure 10: R-move linear systolic array 
of discrete Fourier transform based on 
equations (4) and (6). Note that register 
Up is preloaded with Wi and register 
B is initially loaded with Wi - *. 



Figure 12 : R-move linear systolic array 
of discrete Fourier transform based on 
equation (7). 



Figure 13: R-3tay linear systolic array Figure 14: R-move linear systolic array 

of discrete Fourier transform based on of discrete Fourier transform based on 

equation (7) and (8). equations (7) and (8). Note that in the 

ith processor, register Up is preloaded 
with Wi and register B is initially 
loaded with Wi - *. 



Figure 15: R-stay linear systolic array 
of discrete Fourier transform based on 
equations (7) and (9). Note that in the 
jth processor, register Vp is preloaded 
with Wj and register B is initially 
loaded with Wj - * . 


Figure 16: R-move linear systolic array 
of discrete Fourier transform based on 
equations (7) and (9). 
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abstract 

In this paper we consider the problem of modeling and 
equalization of a nonlinear satellite channel. The channel is 
assumed to be bandlimited and exhibits both amplitude and 
phase nonlinearities A discrete time satellite link is modeled 
under both uplink and downlink white Gaussian noise. Under 
conditions of practical interest, a simple and computationally 
efficient design technique for the minimum mean square error 
linear equalizer is presented. The bit error probability and some 
numerical results for a BPSK system demonstrate that the 
proposed equalization technique outperforms standard linear 
receiver structures. 


I INTRODUCTION 

The problem of nonlinear channel modeling and equalization 
is of analytical and practical interest. An important example of a 
nonlinear channel is a digital satellite communication link, which 
uses a Traveling Wave Tube (TWT) amplifier operating in a 
near saturation region The TWT exhibits nonlinear distortion 
in both amplitude (AMAM conversion) and phase (AM/PM 
conversion). In addition, at high transmission rates the channel's 
finite bandwidth causes a form of distortion known as 
Intersymbol Interference (ISI). 

In this paper, we will examine the problem of modeling and 
equalizing this type of nonlinear satellite communication link. 
The observed data are corrupted by additive white noise, 
uncorrelated with the input data 

A number of other researchers have studied this problem. 
Fredncsson [lj. considered a QPSK system and specified an 
optimum linear receiver filter using a Mean Square Error (MSE) 
criterion. The channel nonlinearity in [1] was handled via 
successive number of linearizations Mesiya et al. [2-3] analyzed 
the BPSK system. In [2] a maximum likelihood receiver was 
considered, while in |3] a simpler linear receiver, based on the 
MSE criterion, was presented In both [2] and (3). the 
nonlinearity of the TWT is expressed in terms of Bessel function 
integrals. The MSE criterion was also applied by Bigiieri et al. 
[4] in their derivation of an optimum linear receiver. In [1], [3], 
and 1 4). the authors work in the frequency domain, and the 
solution is given in terms of integral equations that usually have 
to be solved using numerical techniques 

In [5], Ekanayake and Taylor presented a suboptimum 
maximum-likelihood type decision feedback receiver. However, 
because of the analytical complexity of their solution, they 
approximate the TWT with a hard limiter A different modeling 
approach was taken by Benedetto et al [6], First, they identify 
the whole channel using a Volte mi Series expansion [7], Then 
they suggest a nonlinear equalizer, based again on the MSE 


criterion. Although at the output of a nonlinear equalizer the 
MSE is smaller than the MSE at the output of a linear equalizer, 
it is not completely dear if there is a significant improvement in 
the probability of error performance of the system to justify the 
complexity of die nonlinear receiver. 

In this paper, we present the design and pe rf ormance analysis 
of the optimum linear MSE receiver for a nonlinear satellite 
channel. While the methods considered here are applicable to 
various in-phase and quadri-phase modulation systems, for 
simplicity and Lack of space we will illustrate this approach by 
using only BPSK examples. More generalized results will be 
presented elsewhere. There are two major differences between 
our design as compared to the above reviewed approaches. 
First, we use a very ample model for the input-output 
relationship of the TWT amplifier, proposed first by Saleh [8]. 
Second, by working in the discrete time domain we avoid the 
complex integral equations of the other approaches. In addition, 
a fast and simple iterative algorithm [9] permits the easy 
computation of the autocorrelation coefficients of the output of 
the nonlinear system. Thus, we are able to obtain a new simple 
and computationally efficient linear equalization technique under 
the MSE criterion. Based on the same modeling approach, a 
zero forcing type of linear equalizer was also presented in [10]. 

In Section 2, a simplified model for a typical satellite link is 
presented and the corresponding BPSK discrete model is 
derived The optimum MSE equalizer is presented in Section 3. 
In Section 4, the probability of error performance of the receiver 
is derived Finally in Section 5 some numerical examples, and 
comparisons with standard linear receivers are presented. 

n. CHANNEL MODELING 

Consider the simplified model of a digital satellite 
communication channel as shown in Figure 1. We will e x ami ne 
each one of the different subsystems composing this model. This 
study will enable us to derive an equivalent discrete model. By 
working in discrete time we will avoid the analytical problems 
arising with continuous signals. Our analysis is similar to that of 

Ekanayake and Tayloj [5]. 

The source output is a random sequence (U(d)} of equally 
probable uncorrelated symbols. Thus, in a BPSK system. 

U(n)=0-1) at n=0. T. 2T where P[U(n) == 1] = P[U(n) = 

-1] = 0.5, E[U(n)U(n-k)] = 0 for k » 0, and T ■' is the signaling 
rate. 

Let p(t) denote a pulse shaping function. Often it can be a 
rectangular function of unit amplitude over a time period of 
length T. In any case, the output of the modulator can be 
expressed in the form 
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*(»)■ £U(")pO-iiryuKu) C t, (i) 

where u, it the earner frequency. 

We (hell assume that the transmission filter is the one which 
determines the channel bandwidth. This filter is also responsible 
far the creation of IS. Let G(t) - 2g(t)cosu r r be the impulse 
response at this filter, where g(t) is the impulse response at a 
corresponding low pass filter. Then the output of this filter can 
be expressed as 

jjUin'M.-nT)^',. (2) 

where h(t)-g(t)*p(t). The purpose of our analysis is the design 
at a receiver structure for the estimation of the transmitted 
source symbol during the o tb signaling interval nT Sr s (n +1)7 . 
Thus during the n th signaling interval (2) can be rewritten as 

U(n)h(t-nT)cc6U't + £ U(i )h (t-iT)cosu> r / , (3) 

»iTst£(»i + l)T. 

The first term in (3) represents the transmitted symbol we want 
to estimate, and the second term represents the IS] due to the 
filter. 

On the uplink channel. r.(t) is corrupt ed by additive white 
Gaussian noise Thus, using the narrow band model for the 
noise, the input to the TWT can be expressed as 

0 ) * s. (r )+ n« (t )cosu> c r -n«, (r )sinu> r r. (4) 

iw(') and n„(i) represent the in-phase and quadrature 
components of the uplink noise, each with zero mean and 
variance a, 5 . From (3) and (4) 

*',(<)“ r.(i)cos(uj r f + X(r)). (5) 

where 



» 4 (t) 


rigor* 1: attaint* Ca— Scats— Hlta Hndal 

which are originally specified by the manufacturer. In Figure 2 
the P(r) and Q (r) functions of (10) are plotted for 
a, -2.0922,&, -1.2466, a, -5.529 and fi,«2.7088[8]. All 

input and output voltages were normalized. 

Because of the downlink additive white noise m(i), the 
received waveform r'a(») can be expressed as 

x'*(f)- sr(t)+/iar(')cos»u c r-n A (j)Hno> r r, (11) 

nT £ r £ (n + 1 )7. 

The signal j'*(/) of (11) is now coherently demodulated by a 
carrier 2cosu> c r . We assume that the bandwidth of the receiving 
filters is wide enough so that no additional ISI distorts the signal 
The output y(t) of the demodulator is sampled every T seconds 
to produce at the n th signaling interval the in- phase sample 


r.(t) - ((e (/ )*rw (i )) 2 +- n^(r )} H . 


( 6 ) y<")“y('o)=FM»o)]cosX(io)+ (12) 

♦ Q K ('o)jsinX(ro)- 1 n* («o). 


r(0» V(n)h(tnT)-<- £ U(i)li(i-,T). (7) 

I if • 

and 


k(r ) - tan 


i ”«.(') 

' 0 ) t «,(')' 


( 8 ) 


The TWT is a nonlinear memoryless amplifier. It exhibits 
nonlinear distortion in both the amplitude and the phase. Using 
a quadrature model, the outpui j,(r) of the TWT can be 
expressed in the form 


) - f [r. (/ )]cos(o) t i + k(t ))-£?[ r. (r )]sin(«u c r + k(r )). (9) 


(o is an appropriately chosen sampling instant within the interval . 
nT £r £ (n + 1)7. 

Under the assumption of high available power at the earth 
stations, the effects of the uplink noise can be considered 
negligible. Thus we can assume that X(r)=0. Then y(n) of (12 » 
becomes 

y(n)= P[r(r 0 )) + n^(ro) O') 

From (7) and (13) an equivalent discrete-rime model for the 
communication channel of Figure 1 can be represented as in 
Figure 3. Now. with U(n) = {!.-]}, the basic relationships are 


From Saleh [8] an expression of P(r) and Q(r) is given by 


F(r)-o, 

and 


T +JP 7 


(10a) 


r(n) = A^*f^hiU(n-k) , 

P(")=F[r(„)]- T f^ T . 


(14) 

0-M 


g(r) " a * d+C t y (10b) 

The coefficients of (10) are obtained by b least-square error 
curve fining procedure of the specific TV<T characteristics. 


y(n) = P(n)+w(n) , (16) 

where o and (5 are specified constants that depend on the 
qsdfic type of the TWT. w(n) is white Gaussian noise of zero 
mfin and variance ai, and uncorrelated with the input data. 
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The values of N1 mod N2 represent the memory of the 
transmitting filter. The gain A depends on the specific operating 
paint of the TWT. 

m. THE MEAN-SQUARE ERROR EQUALIZER 

Let the receiver output z(n) be expressed as the output of a 
Tapped- Delay Line (TDL) filter in the form of 

,(n) ’,S. c ' > ' ( "'* ) ' (17) 


Computation ef the Autocorrelation coefficient 

The sequence {P(n)} at the output of the nonlinearity can be 
considered at the output of a finite Sate sequential machine. 
Since the nonlinearity hat no memory, from (14) the state 
sequence can be given by 

#(«)- (t/(a+ATl) U(n),U(n-\) f/(aWV2)] . (21) 

{U(d)) is an i.i.d sequence, thus (s(n)) is taelf a stationary 
Markov chain [9], 

Let ut denote by II the transition probability matrix of the 
Markov chain (s(n)). A brute force evaluation of R p {) involves 
multi plication of square matrices of dimension [9-10], 

which would be computational unpractical unless special 
consideration is given to the special properties of H. In [9] a 
particularly effective and simple algorithm for the evaluation of 
autocorrelation coefficients of a nonlinear system was presented. 
The algorithm, as applied to our specific problem is given below. 

Algorithm for the computation of R p (k ) 

1. Let N-N1+N2+1, and store in vector f)o (of dimension 
V ) the values at the output of the nonlinearity for each state s(j) 
cf(21),forj-U,...^'. 

2. Compute the vector o« ( of dimension V ), where the j tb 
component is given by 

«o(/)-e.0)/2*.>«l,2 V . 

3. For k-0,l,...,N-l, do the- following computations 

•>) 

b) Store in the first 2* - *' 1 positions of a*, the vector a 4 *i, 
computed by 


where from (16), y(n)=P(n)+w(n). 


o»*t0') = “tO ) + “tG +2*’-*- 1 ) , y — l,2,...,2 A,-< ' 1 . 


In the theory of the Mean-Squares criterion, the tap weight 
coefficients {r,} of the equalizer are adjusted to minimize the 
mean square error 

€»«E[y(«)-Jj? *>>-*)]». (18) 

Minimization of (16) with respect to the {c ; l coefficients, yields 
the linear system of M= Ml + M2-t 1 equations 

V c i R,(J k)'‘R*,U)- M2. (19) 

*•*1 


c) Store in the first 2" 40 positions of of the vector, the 
vector P* _ i . where 


P»*t0) = 


fi*(2y-i)+3*(2y) 

, y-l,z, 


2/v-ti 


4. *,(*) = 0, forla/V. 

The above algorithm is easy to implement and requires only two 
vectors of size 2 s as basic computation storage. 

Computation of Cross-correlations 


where R,(k)~ R,(-k ) = E[y(n)y(n-k)) and *„(*) - 

E(U(n)y(n-k)] for all values of k. 

From the uncorrelatedness of the input data and the noise. 
R^(k)-= R m (k ). lot all values of k Also, since the output P(n) 
of the nonlinearity, and the noise w(n) are independent 

R,(k)= R,(k)+alh «. (20) 


Since for each state s(j) of (21) the value of P(s(j)]= 3oO ) 
has already been computed for the evaluation of the R p {) 
coefficients, a brute force technique can be easily applied for 
the evaluation of the cross-correlation terms. Thus from [10], 

*_(-*) = (l/2*->) y /•[*(/)]. -NlsksN2, (22) 


where o; is the variance of the noise. Thus in order to solve 
(19) it is necessary to evaluate first the necessary R p () and 
Rqjf) coefficients. While in the case of a linear channel the 
calculation for the R p {) coefficients is straightforward, in the 
nonlinear rase the evaluation may present some numerical 
difficulties. 


where the summation in (22) is done over all those states where 
U(n-k) = 1. 

Id summary, the design procedure for the optimum linear 
MSE equalizer is given as follows. First, compute the 2 s 
possible values of P(n) at the output of the nonlinearity. Then 
use the algorithm to compute the R p {) coefficients and (22) to 
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compute the R^A) coefficients. Finally, «olve the linear lynem 
in (19). The solution at (19) yields the tap-weight coeffideott of 
the MSE receiver. 

IV. EVALUATION OF BIT-PROBABILITY OF ERROR 

Unfortunately, there is no ample relationship between the 
residual mean square error of the MSE receiver and the bit error 
probability (11]. Fes' moderate channel and equalizer memories, 
a brute force method that yields an exact result could be applied. 
Denote by D, one of the 2***' 1 possible realizations at the input 
erf the receiver, with U(n)- 1, and by c the M x 1 vector at the 
filter coefficients Then from (16) and (17), the receiver output 
due to a specific input (U(n)) sequence is given by 

xj -D/c + Wc , (23) 

where Wu«M*Ml+M2-fl row vector erf noise samples. 

Let (w(n)) be a white noise sequence erf zero mean and 
variance oi. Let n-Wc. Then £[n]«0 and (t$- o2 y e 
Then with U(n)- 1 and for a threshold of zero, the conditional 


arm probability 

P.(i ) « Pr\D,c +n«V{l/fO}] , (24) 

is fixed, and 

MO-CP/f/o,). (25) 

where 

£?(*)- (26) 
Then the average error probability P, is given by 

P, - (1/I)£ MO- . (27) 

l ■ 1 

In our numerical example the SNR is defined as 

SSR = 10 log 10 (P^/2o 2) , (28) 

where £«, = (1/i )^ / , [r«' )(0)) 


If the exaci error probability of (27) proves loo cumbersome 
and too time consuming to evaluate because of the large number 
of terms, one can resort to a number of different approximate 
methods that yield tight upper and lower bounds of P r [11]. 

V. NUMERICAL EXAMPLE 

The purpose of this section is to illustrate the application of 
out results in the design of a linear optima! receiver, and to 
compare its performance with other receivers for a digital 
satellite link. 

In our model of the linear part of the satellite link, we 
assume that the LSI is introduced by a 3-pole Burterworth filter. 
The rwo sided bandwidth B of the filter is the same as the 
minimum Nyquist rate (i.e.. BT— 3). The number of samples 
considered for the 1S1 is determined by those 1SI samples whose 
magnitude are at least greater than 0.01 times the main sample. 
In our example, a channel memory (N1-T-N2) of 3 ISI terms was 
considered adequate. 

The characteristics of the TWT for this study are the same as 
those in Figure 2. Thus in the evaluation of P(r(n)] in (15), the 
parameters of the TWT are a= 2 .0922 and fl= 1.2466. As 


mentioned before, those values were taken from S aleh Q8J , 
Figure 5) and r ep re sen t a specific satellite TWT. The gain factor 
A, erf (14) was determined so that with do 153 the TWT would 
operate at the 2 dB input backoff point. Because erf the kw ISI 
introduced by the transnisaoo filter, a 4-tap (M1+M2+1-4) 
TDL linear receiver was considered to be adequate. Thus, 
i-2<*~«)«64. 

Now we co mpare our optimum linearly equalized MSE 
receiver with the conventional linear receivers. Using the brute 
force technique described in Section 4, the bit error probability 
far the various recovers was evaluated and plotted in Figures 4 
and 5, for values of SNR as defined in (28). 

Figure 4 exhibits the P. performance at the designed MSE 
filter, and the P, performance of two 3- pole Butterworth 
receiving filters. One receiving filter (with BT- 1) is identical to 
the transmitting filter, while the other one has BT- 0.75. In 
Figure 5, the performance of the M.S receiver is compared with 
that of two 4- pole Butterworth receiving filters. One has 
BT-0.75 and the other one has BT-1. A numerical search 
proc e dure for Butterworth filters with different BT products, 
ihowed that an increase in BT does not n ece s sari ly correspond to 
an improved P, pe rfo rmance. In fact, filters with BT-2 are 
only marginall y better than filters with BT— 1, 



ncclrUg fllun. 


For P, = 10E-6, the optimum MSE receiver is at least 0.5 
dB better than the 3-pole Butterworth filters and 1.2 dB bene? 
than the 4-pole Butter-worth filters. The P, performance of a 
channel with oo ISI. but with the identical TWT, carrier power 
and noise variance was evaluated. The numertcal results showed 
tha t for these examples the bit error rate of the MSE equalizer is 
very close to that of the no IS! case (10) . 
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VI. SUMMARY AND CONCLUSIONS 

Is this paper we considered the problem of modeling and 
equalization of a digital satellite nonlinear and 
channel. Starting from a typical satellite link, we developed the 
corresponding BPSX discrete-time model, and solved for the 
optimum linear MSE receiver. A ample and computationally 
efficient algorithm was derived for the evaluation of the 
equalizer coefficients, based oo the memory! ess nonlinearity of 
the system. Numerical examples for a typical satellite link 
demonstrated that the optimum linear MSE receiver outperforms 
die conventional linear type receiving filters. In general, our 
modeling and equalization techniques provide a ample and 
computationally efficient alternative to existing approaches. 
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and Kalman Filtering by Systolic Arrays ^ 3 3 fS 
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M.J. Chen and K. Yao ^ 


l. introduction 


Least-squares (LS) estimation is a basic operation in many signal process 
ing problems. Given y=Ax+v, where A is a mxn coefficient matrix, y is a 
mxl observation vector, and v is a mxl zero mean white noise vector, a 
simple least-squares solution is finding x which minimizes ||Ax-y||. It is 
well known that for an ill-conditioned matrix A, solving least-squares 
problems by orthogonal triangular (QR) decomposition and back substitution 
has robust numerical properties under finite word length effect since 
2-norm is preserved. Many fast algorithms have been proposed and applied 
to systolic arrays. Gen tl eman-Kung (1981) first presented the triangular 
systolic array for a basic Givens reduction. MCWhirter (1983) used this 
array structure to find the least-squares estimation errors. Then by 
geometric approach, several different systolic array realizations of the 
recursive least-squares estimation algorithms of Lee et al (1981) were 
derived by Kalson-Yao (1985) . We consider basic QR decomposition algo 
ritttns and find that under one-row time updating situation, the House 
holder transformation degenerates to a simple Givens reduction. Next, we 
derive an improved least-squares estimation algorithm by considering a 
modified version of fast Givens reduction. From this approach, the basic 
relationship between Givens reduction and Modified-Gram-Schmidt transfor 
nation can easily be understood. We also can see this improved algorithm 
has simpler computational and inter-cell connection complexities while 
compared with other known least-squares algorithms and is more realistic 
for systolic array implementation. 

Minimum variance estimation (popularized by Kalman (1960) ) is the general 
ized form of a least-squares problem, where the state vector x is charac 
terized by the state equation x k+ . | =Fx k +w, the system noise w and the 
observation noise v are colored. The original algorithm presented by Kal 
man can have poor numerical property. Some algorithms for improving 
numerical properties, such as square-root covariance and square-root 
information methods have been studied. New, we find that after the whi 
tening processing, this minimum variance estimation can be formulated as 
the modified square- root information filter and be solved by the simple 
least-squares processing. This new approach contains advantages in both 
numerical accuracy as well as computational efficiency as compared to the 
original Kalman algorithm. Since all these processings can be implemented 
by systolic arrays, high throughput rate amputation for Kalman filtering 
problems become feasible. 



2. SIMPLE LEAST-SQUARES ESTIMATION 


Given the equation b=Ax+v, it is well known that we can solve the least- 
squares solution £ by normal equation. However, this approach not only 
requires tne computation of a matrix inverse but also doubles the condi 
tion number when we form A 'A. Although using singular value decomposition 
for least-squares solution can improve numerical properties, the computa 
tional complexity involved in SVD is not low. Besides, fast algorithm for 
SVD is still underdevelopment. Lattice structure for least-squares solu 
tion was proposed and studied by Lee et al (1981) . This approach was 
shown to have stable numerical property and regular hardware structure. 
However, this method required shifting property of the coefficient matrix 
and can not apply to all general cases. QR decomposition is another solu 
tion to obtain x, since 2-norm is preserved by multiplying an orthogonal 
matrix Q, then by letting QA=R be a upper triangular matrix, the x can be 
obtained by using back substitution for the equation Rx=fe. This approach 
has robust numerical properties since the 2-norm is fixed, the rounding 
error caused by finite word length effect will not grew. Basically, there 
are three ways for performing QR decomposition, namely, Householder trans 
formation, Givens reduction, and Modified -Gram- Schmidt orthogonalization. 
It can be shown that under one row time updating situation (as in the sys 
tolic array implementation) , the Householder transformation matrix will 
degenerate to a simple Givens reduction case. 

Systolic array implementation for QR decompositions in least-squares esti 
nation was first explored by Gentleman-Kung and followed by MCWhirter and 
Kalson-Yao. By using a triangular systolic array, it was shown that the 
estimation error for the last observation can be solved at every clock 
period. The systolic array structure for least-squares estimation is 
shown in Figure 2.1. To achieve fully pipelined operation, the input rows 
are skewed and propagated like wavefronts in the diagonal direction. There 
are only two basic processing units, boundary cell and internal cell, are 
required by this systolic array. Communication between different process 
ing units are all local. The properties of regularity and local ccnmuni 
cation are consistent with the philosophy of VLSI implementation. Summary 
of input/output formats and operation functions for two kinds of process 
ing units are shewn in Table 1 and Table 2 respectively. 


Table 1. Input/Output format of systolic array algorithms 



BI 1 

BI 2 

B°1 

BCL 

2 

n i 

M 

M 

ro 

M 

M° 

I0 2 

Givens 

0 

X 

o' 

d/d' ,x/d' 

d/d',x/d’ 

b d/d',x/d' 

b' 

F-Givens 

0 

X 

o' 

d/d',ax/d' 

X 

d/d ' , ax/d 1 

X 

b d/d',ox/d' 

X 

b’ 

M-G-S(I) 

0 

x,e 

a' 

x/d,x/(l-a) 

d 

x/d,x/(l-o) 

d 

b,e x/d,x/(l-a) 
d 

b*,e' 

M-G-S(II) 

0 

X 

a' 

x/d',x/(l-o) 

x/d' ,x/(l-o) 

b x/d',x/(l-o) 

b' 





The above symbols are for notations only, their physical meaning nay 
change for different algorithms. 

Table 2. Operational functions of processing units 


Boundary cells 

Internal cells 

Givens 

d'=(d 2 +x 2 ) 

BCL<— d/d' , x/d' 
o's (d/d')*o 

b'=(d/d')*b - (x/d')*k 
k'=(x/d')*b + (d/d')*k 

F-Givens 

d'=d + (o*x)*x 
BQ~< — (o*x)/d*, d/d* 
o's a* (d/d') 

b'=b - x*k 

k , =(d/d')*k + (°*x/d')*b 

tt-G-S(I) 

d=e 

BCL<— x/d, x/Q-a) 
o — o + (x/d) *x 

k’=k + b*(x/(l-o) ) 

b'=b - k'*(x/d) 
e'=e - k'Vd 

M-G-S(II) 

d'=d + (x/(l-a))*x 
BO-< — x/d', x/(l-a) 
o-o + (x/d')*x 

k'=k + b*(x/(l-o)) 
b’=b - k'*(x/d') 


Fran systolic array point of view, the difference between algorithms pro 
posed by MdVhirter and Kalson-Yao lies in the basic computations in two 
kinds of processing units. Since these algorithms were derived from two 
different approaches, specifically Givens reduction and Modif ied-Gram- 
Scftnidt orthogonal ization, the basic relationship for these two QR decompo 
sit ion methods under one rcw time updating can be compared as follows. 
First, we derived the modified expression for the fast -Givens reduction as 
given by 


(l//d)d, (l//d)dk 0 ,... (l//d)dk " 

2 k 

>x, > / o'b 2 , ... ► / o"b k 

(l/v / d')d',(l/v^d')d'k 2 ',...(l//d , )d'k k r 

0, *^’b 2 ’, ... ^’b k ’ 


the updating equation for this modif ied-fast-Givens algorithm becomes. 

Boundary cell: d'=d + x 2 /(l/o) (l/o') = (l/o) + x 2 /d 

Internal cell: b'=b - (x/d)*dk d'k'=dk + b*x/(l/o) [1] 


By comparing the computational complexity between the fast Givens algo 
rithm by Gentleman (1973) and that in [1] , we can see [1] has one multi 





plication less than the original algorithm. And since we do not have 
interest on the real rotated elements like (1/ v'Sjdk., we do not have the 
risk of dividing by a very small d. The numerical properties of the modi 
fied algorithm is then expected to comparable to the numerical properties 
of the original one. By equation [11, the basic duality associations 
between Givens reduction and Modified-Gram-Schmidt orthogonalization is 
summarized in Table 3, which allows us to derive different algorithms for 
least-squares estimation from different approaches with efficiency. 


Table 3. Duality association for M-G-S and Fast-Givens reduction. 
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With systolic array implementation, comparison of computational complexity 
for algorithms discussed above can be made by comparing the number of 
operations required in each processing unit. When the dimension of the 
coefficient matrix becomes large, wavefront array processing of Rung 
(1983) becomes more appropriate for the control scheme. In this case, the 
speed of this "wavefront" will be decided by the slowest processing unit 
along each wavefront. In modified fast Givens algorithm, equations for 
boundary cell are non-recursive and can be done in parallel if we can 
double the computational capability of each boundary cell. In this case, 
the wavefront speed and then the throughput rate can be doubled. The sys 
tolic array we discussed above will generate estimation error at each 
clock period. While the estimated vector & is not shown explicitly, St can 
be solved by back substitution which can be done by just appending a nxn 
identity matrix after the coefficient matrix A. 


3. MINIMUM VARIANCE ESTIMATIONS AND KALMAN FILTERING 


Often the signal vector x is a random process and can be modeled as a 
first order recursive equation. In this case, a first order recursive 
estimation (or Kalman filtering) problem can be stated as follows, 



[ 2 ] 


where F and C are time-varying coefficient matrices with dimension nxn and 
mxn respectively, w. is a nxl and v. is a raxl zero mean noise vectors 
with known covariance matrices W. ana v. respectively. It is assumed that 
noises w and v are uncorrelated and E[w?w.]=E[v.v.]=0 for all i^j. Under 
the minimum variance criterion, we want 1 ^ find x^ for all k, such that 
E|I(x, - x, ) 2 J| is minimized. Kalman showed that x k can be obtained by the 
recursive algorithm given as 
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The information matrix is defined as the inverse of the error covariance 
matrix P. Besides [3] , it is shown that instead of propagating the error 
covariance matrix, the Kalman filtering problem can be solved by propagat 
ing the information matrix during the iterations. Both covariance and 
information filters are recursive since the current updating depends only 
on results from previous stage. The choice between covariance filter and 
information filter depends on the values of n and m. When n>m, which is 
usually the case, the original Kalman filtering is chosen to avoid the 
inverse of the nxn matrix. However, Kalman algorithm is known for its 
poor numerical properties, especially for non-observable coefficient mat 
rices. The original Kalman filter needs an approximate 0(n J ) multiplica 
tion time for each iteration. If m>l, computation of a matrix inversion 
is inevitable. Since all equations are sequential in manner, if real time 
computation is required for a Kalman filtering problem, some modifications 
must be done to insure the capability for parallel computation. Among 
many possible modified algorithms, square-root filtering have been proved 
to have computational efficiency and robust numerical properties under 
finite word length effect (Kaminski 1971) . The main advantage of the 
square root filter is that we can handle the covariance matrix by its 
square root form which has condition number smaller than the original one. 
Therefore, for ill-conditioned problems, when we used the square root fil 
ter with a single precision machine, we can expect the same numerical 
result as if we have used the original algorithm on a double precision 
machine. Updating processings for both square root covariance filter and 
Square root information filter can be expressed in matrix forms and 
handled by the QR decomposition method which is capable of systolic array 
implementation. However, only square-root information filter allows us to 
update the estimated state vector as well as the information matrix by 
using the same transformation matrix Q. When both updated covariance mat 
rix and state vector are important to us, we find square-root information 
filter is a better solution for the systolic array implementation. The 
Square-root information filter requires computation of the inverse of the 
coefficient matrix F, which will cause bad numerical properties for F 
being near singular. Che version of the square root information matrix 
method for Kalman filtering was considered by Paige and Saunders (1977). 

It is shown that by using whitening processing through Cholesky decomposi 
tion, the Kalman filtering can be represented as a simple least-squares 
problem. This approach does not require the computation of the inverse of 
the matrix F and is more suitable for systolic array implementation. 


The whitening processing can be briefly described as below. Assume 
W=L,L ’ and V=L L ' are the Cholesky decomposition of covariance 
matrices w and ^.^_J*?ith W =L L ' and v’=L L it can be proved that 
L =L and L =L . w, =L 'w^ Snd v. =L 'v. v are whitened noises with 
iSen^ity covaYiaKce matrices. v 


Denote F=L 'F, C=L 'C, and =L 'y. . We can express the whitened 
System equations in the matrix-vector form as 
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get =lJt,...x k 3 by solving 14] as a LS problem, 
deccnposition to i41 at time ic, we have 


After 
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[5] 


We can see that R., i=l,2...k, in 15] are all upper triangular matrices, 
and x, , the optimum estimated vector at time k, depends only on the last 
line, i.e. , R x, =y,. Furthermore, at T=k+1, the updating equation 
depends on the last row of 153 only, Hiat is, the QR decomposition at 
T=k+1 only depends on a (2n+m)x(2n+l) matrix as in [63. When the QR 
decomposition of [63 is completed, we have K . (upper triangular) and 
y k+ ^ ready for iteration of next stage. 
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where * is the term used to compute the residue. 


Hie upper triangular matrix R. can be shown to be the square-root of the 
inverse of the error covariance matrix P. =E[ (x. -x. ) (x. -x. ) '] . Hiat is, 
this algorithm, which propagates the square root information matrix for 
next iteration, is actually a modified square-root information filtering. 


4. SYSTOLIC ARRAY IMPLEMENTATIONS FOR KALMAN FILTERING! 


From last section, we can see that the basic operations for square root 
Kalman filtering can be described in two parts. Hie first one, whitening 
processing includes operations such as Cholesky decomposition, inverse of 
triangular matrix, and matrix multiplication. Secondly, the QR deconposi 
tion is applied. Obviously, these two parts can be operated in parallel. 
Hiat is, we can start the whitening processing for the (k+l)st iteration 
as well as the QR decomposition for the k-th iteration at the same time in 
a pipelined manner. 

Hie original Bquare-root information filter involves the computation of 
the inverse of the coefficient matrix F which not only increases the com 
putational complexity but also causes bad numerical properties when coef 
ficient matrix F is singular or near singular. Hus shortcoming can be 



recovered by choosing the modified square root information filtering in 
[41. As shown from [4] -16], formulation of the modified square-root 
information filter involves only multiplication between coefficient mat 
rices and the inverse of the square root noise covariance matrices. For 
noise with positive definite covariance, square root covariance matrix 
always exists. 

4.1 Whitening Processing 

The whitening processing is done by multiplying the coefficient matrix 
with a whitening operator L* where (LL 1 ) -1 is the given covariance matrix 
of the additive noise. Since a covariance matrix is a positive definite 
symmetric matrix, the square root matrix can be obtained by the Cholesky 
decomposition. A triangular systolic array for Cholesky decomposition is 
designed for this purpose with outputs skewed to match the input format of 
the QR systolic array. 

The inversion of a upper triangular matrix is simple after we built the 
basic systolic array for QR decomposition. The idea for the inversion of 
a upper triangular matrix is the same as solving the back substitution. 

With UU 1= I, let U u 2 , ... u ], with jj. being a nxl column 

vector. A matrix inversion'can be divided into n sets of linear equa 
tions, each having the form of Uu.=£. , i=l,2,...n, where s.. is a nxl 
column vector with i^ 1 element equals to 1, and all othersDeing 0, and 
can be solved by a systolic array. 

4.2 QR Decomposition for Kalman Filtering 

Equation [6] suggests that X. can be solved as a least-squares solution by 
a 2nx2n QR_systolic array. However, serious delay will be caused by the 
fact that R. and R. + , are not in-place computations. That is, we have 
trouDle bo move the newly formed R from the upper-right corner to the low 
er-left corner in our triangular array for the next iteration. That is, 
the computation at stage k+l can not start until the last element of R. is 
completed. In this "waiting" period, most of processing units are idle 
and the pipeline is empty. It will cause delay for at least 2n clock 
periods. 

This disadvantage can be overcome by in-place computations for R. and 
R. + , . This can be done by partitioning the original matrix into K two 
strips, and perform the partitioned QR decomposition by the systolic array 
structure proposed in Figure 2. In this approach, a nxn QR systolic array 
as well as a rotation array which consists of nx (n+1) internal cells are 
used. Once elements of R^ are formed, it is ready to be used for 
computations at stage k+l. Here we need only to pass transformed elements 
generated by the first strip to the rectangular rotation array for the 
pre-processing of the second strip. This input format is shown in Figure 
3. Since all these can be done in fully pipelined manner and in-place 
computations are obtained, complicated inter-cell connection and control 
scheme can both be avoided. To obtain the estimated value , we can 
just append an identity matrix I after the second strip, and we get result 
every 3n+m clock periods. 



5. CONCLUSION 


In this paper, we first survey existing algorithms for least-squares esti 
nations by systolic arrays. Basic comparisons are made based on conputa 
tion and inter-cell connection complexities of elementary units. Finally, 
by choosing the square-root information filtering algorithm, we showed a 
simple way to solve the Kalman filtering as a least-squares problem that 
can be processed by systolic arrays. Systolic array for Cholesky deconpo 
sition is also proposed for whitening processing. By manipulating the 
data properly, the Kalman filtering can be processed under fully pipelined 
manner. There is no special constraint on our system equations and stan 
dard time-varying coefficient matrices and non-stationary colored noises 
are assumed in our model. Most of the processing units we need for this 
square root information filter do not involve square-root computations. 

The only exception is the computations for the Cholesky decomposition. 
However, for pipelined operation between whitening processing and QR 
decomposition, the later certainly involved more computational work than 
the former. Since there is only n square-root computations required in 
each iteration as compared with the operations required for QR deconposi 
tion, Cholesky decomposition will not become the bottleneck for this algo 
rithm. For many real life problems where we can assume noises are sta 
tionary, then covariance matrices W and V are fixed during our operation. 
In this case, inversed square-root covariance matrices can be obtained by 
pre-processing and our Kalman filtering can be solved as a simple least- 
square problem. Since all operations can be performed by the designed 
systolic array processing, which have the input/output formats matched to 
each other, the entire hardware design can be viewed as a pipelined struc 
ture. The estimated vector can be obtained with the O(n) in time while 
compared with the 0(n ) for the original Kalman filter. Finally, since 
this is a square root matrix operation, good numerical property can also 
be obtained. 
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Figure 1: Systolic array for least-squares estimation. 








Figure 3: Input format for systolic array Kalman filtering 








